Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix memory visibility error #1936

Merged
merged 2 commits into from
Jun 25, 2019
Merged

Conversation

giant-panda666
Copy link
Contributor

@giant-panda666 giant-panda666 commented Jun 23, 2019

One day, alertmanager could not send alerts in my production. So i made a dump file. The dump file shows that Dispather blocks on cleaning up empty aggrGroups while aggrGroup blocks on ag.done in aggrGroup.stop(). After analysing the code, aggrGroup.stop() should terminate the aggrGroup.run() loop, and then aggrGroup.stop() exits normally. But in fact, it does not work as we hope.

I think that there is a memory visibility error. aggrGroup.run() runs as a goroutine which initializes ag.done, and aggrGroup.stop() runs in another goroutine which reads ag.done channel itself and waiting value from this channel. This is a data race! If those two goroutine runs on different cpu cores, aggrGroup.stop() may be read from a nil channel even though aggrGroup.run() has initialized the channel. So i fix this error by initializing ag.done in newAggrGroup().

There is may be another error. If aggrGroup.stop() runs before aggrGroup.run(), aggrGroup.stop() may be also waiting from a nil channel because ag.done not be initialized in aggrGroup.run(). Of course, it seems that aggrGroup.stop() won't run before aggrGroup.run(). But i don't think that initializing shared variables in one goroutine is a good practice.

giant-panda666 and others added 2 commits June 22, 2019 12:11
Syncing to the original master
@giant-panda666
Copy link
Contributor Author

It's very hard to reproduce this erorr, so i sorry that i cann't make a unit test. All of above is my analysis and this is the only reason that i provide.

@stuartnelson3
Copy link
Contributor

Thanks for the in-depth analysis! I'll give this a read and look through it, and see if I can come up with a unit test.

@simonpasquier
Copy link
Member

simonpasquier commented Jun 24, 2019

The code looks ok. I see no reason why the done channel wasn't initialized in the newAggrGroup() function. And I don't see an easy way to write a unit test for this either.
There might also be a race condition if ag.stop() is called before the first alert is added to the aggregation group? I was wrong since *Dispatcher.processAlert() is protected by a lock and the same for the clean-up section.

Copy link
Contributor

@stuartnelson3 stuartnelson3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if @simonpasquier is happy with it, it's fine by me

@simonpasquier simonpasquier merged commit 5ff6cff into prometheus:master Jun 25, 2019
@simonpasquier
Copy link
Member

Thanks!

DuskEagle pushed a commit to DuskEagle/alertmanager that referenced this pull request Aug 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants